feat: Add NpyCodec for lazy-loading numpy arrays #1331

dimitri-yatsenko · 2026-01-12T22:04:29Z

Summary

Introduces the <npy@> codec for schema-addressed NumPy array storage with lazy loading, and refactors hash-addressed storage to use path-based retrieval.

Key Features

NpyCodec (<npy@>)

Lazy loading: Inspect array shape and dtype without downloading
Memory mapping: Random access to large arrays via mmap_mode
NumPy integration: Transparent array operations via __array__ protocol
Safe bulk fetch: Returns NpyRef objects instead of downloading all arrays
Portable format: Standard .npy files readable by NumPy, MATLAB, etc.
Schema-addressed: Paths derived from primary key ({schema}/{table}/{pk}/{attr}.npy)

Hash Registry Refactoring

Path-based retrieval: Full path stored in metadata, used directly for retrieval
Config-change protection: Stored paths guard against subfolding/structure changes
Per-schema isolation: Hash paths include schema name (_hash/{schema}/{hash})

Codec Types

Codec	Store	Description
`<blob>`	In-table	DataJoint serialization of Python objects
`<blob@store>`	Hash-addressed	Large blobs, deduplicated by hash
`<attach@store>`	Hash-addressed	File attachments, deduplicated by hash
`<npy@store>`	Schema-addressed	NumPy arrays with lazy loading ← this PR
`<object@store>`	Schema-addressed	Python objects, path from primary key

Plugin codecs (separate packages, coming soon):

<zarr@store> - Zarr arrays
<tiff@store> - TIFF images
<parquet@store> - Parquet tables

Addressing Schemes

Scheme	Path Derived From	Deduplication
Hash-addressed	Content hash (MD5→Base32)	Yes (per-schema)
Schema-addressed	Primary key	No

Usage

@schema
class Neuron(dj.Imported):
    definition = """
    -> Session
    neuron_id : int16
    ---
    activity : <npy@store>    # Lazy-loading array
    """

# Fetch returns NpyRef, not the array
ref = (Neuron & key).fetch1('activity')
print(ref.shape)      # (1000,) - no download
print(ref.dtype)      # float64 - no download

# Load when ready
array = ref.load()

# Memory-mapped for large arrays
mmap = ref.load(mmap_mode='r')
slice = mmap[1000:2000]  # Only reads needed portion

Changes

New:

hash_registry.py - Refactored from content_registry.py with path-based storage
SchemaCodec - Abstract base class for schema-addressed codecs
NpyRef - Lazy reference with metadata access
NpyCodec - Codec implementation using .npy format

Refactoring:

ObjectCodec now inherits from SchemaCodec
Renamed is_external → is_store throughout codebase
hash_registry functions use stored paths for retrieval
gc.py updated to work with paths instead of hashes

Test Plan

All 643 tests pass
Unit tests for NpyRef metadata and mmap_mode
Integration tests for roundtrip encode/decode
Integration tests for lazy loading and caching

Documentation

See datajoint-docs (docs-2.0-migration branch):

Co-Authored-By: Claude Code [email protected]

Add migrate_external() and migrate_filepath() to datajoint.migrate module for safe migration of 0.x external storage columns to 2.0 JSON format. Migration strategy: 1. Add new <column>_v2 columns with JSON type 2. Copy and convert data from old columns 3. User verifies data accessible via DataJoint 2.0 4. Finalize: rename columns (old → _v1, new → original) This allows 0.x and 2.0 to coexist during migration and provides rollback capability if issues are discovered. Functions: - migrate_external(schema, dry_run=True, finalize=False) - migrate_filepath(schema, dry_run=True, finalize=False) - _find_external_columns(schema) - detect 0.x external columns - _find_filepath_columns(schema) - detect 0.x filepath columns Co-Authored-By: Claude Opus 4.5 <[email protected]>

Implement the `<npy@>` codec for schema-addressed numpy array storage: - Add SchemaCodec base class for path-addressed storage codecs - Add NpyRef class for lazy array references with metadata - Add NpyCodec using .npy format with shape/dtype inspection - Refactor ObjectCodec to inherit from SchemaCodec - Rename is_external to is_store throughout codebase - Export SchemaCodec and NpyRef from public API - Bump version to 2.0.0a17 Key features: - Lazy loading: inspect shape/dtype without downloading - NumPy integration via __array__ protocol - Safe bulk fetch: returns NpyRef objects, not arrays - Schema-addressed paths: {schema}/{table}/{pk}/{attr}.npy Co-Authored-By: Claude Opus 4.5 <[email protected]>

The SchemaCodec (used by NpyCodec and ObjectCodec) needs _schema, _table, _field, and primary key values to construct schema-addressed storage paths. Previously, key=None was passed, resulting in "unknown/unknown" paths. Now builds proper context dict from table metadata and row values, enabling navigable paths like: {schema}/{table}/objects/{pk_path}/{attribute}.npy Co-Authored-By: Claude Opus 4.5 <[email protected]>

…to feature/npy-codec

Merge PR #1330 (blob preview display) into feature/npy-codec. Bump version from 2.0.0a17 to 2.0.0a18. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Address reviewer feedback from PR #1330: attr should never be None since field_name comes from heading.names. Raising an error surfaces bugs immediately rather than silently returning a misleading placeholder. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Support memory-mapped loading for large arrays: - Local filesystem stores: mmap directly, no download - Remote stores: download to cache, then mmap Co-Authored-By: Claude Opus 4.5 <[email protected]>

…orage Major changes to hash-addressed storage model: - Rename content_registry.py → hash_registry.py for clarity - Always store full path in metadata (protects against config changes) - Use stored path directly for retrieval (no path regeneration) - Add delete_path() as primary function, deprecate delete_hash() - Add get_size() as primary function, deprecate get_hash_size() - Update gc.py to work with paths instead of hashes - Update builtin_codecs.py HashCodec to use new API This design enables seamless migration from v0.14: - Legacy data keeps old paths in metadata - New data uses new path structure - GC compares stored paths against filesystem Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Remove uuid_from_buffer from hash.py (dead code) - connection.py now uses hashlib.md5().hexdigest() directly - Update test_hash.py to test key_hash instead Co-Authored-By: Claude Opus 4.5 <[email protected]>

Remove dead code that was only tested but never used in production: - hash_exists (gc uses set operations on paths) - delete_hash (gc uses delete_path directly) - get_size (gc collects sizes during walk) - get_hash_size (wrapper for get_size) Remaining API: compute_hash, build_hash_path, get_store_backend, get_store_subfolding, put_hash, get_hash, delete_path Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions bot added enhancement Indicates new improvements feature Indicates new features labels Jan 12, 2026

dimitri-yatsenko force-pushed the feature/npy-codec branch from 8d7c92e to 08d5c6a Compare January 12, 2026 22:12

dimitri-yatsenko requested a review from d-v-b January 12, 2026 22:13

dimitri-yatsenko self-assigned this Jan 12, 2026

dimitri-yatsenko and others added 7 commits January 12, 2026 16:29

Merge remote-tracking branch 'origin/enhance/blob-preview-display' in…

14d3da6

…to feature/npy-codec

chore: Merge enhance/blob-preview-display and bump to 2.0.0a18

9f6826e

Merge PR #1330 (blob preview display) into feature/npy-codec. Bump version from 2.0.0a17 to 2.0.0a18. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge remote-tracking branch 'origin/pre/v2.0' into feature/npy-codec

6b951d4

feat: Add mmap_mode parameter to NpyRef.load()

12ea814

Support memory-mapped loading for large arrays: - Local filesystem stores: mmap directly, no download - Remote stores: download to cache, then mmap Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix: Remove unused variable in mmap test

c02a882

dimitri-yatsenko requested a review from ttngu207 January 13, 2026 00:21

dimitri-yatsenko force-pushed the feature/npy-codec branch from 58e92e9 to 0d1ffe7 Compare January 13, 2026 19:26

dimitri-yatsenko and others added 2 commits January 13, 2026 14:05

refactor: Remove uuid_from_buffer, use hashlib directly for query cache

d2ab4de

- Remove uuid_from_buffer from hash.py (dead code) - connection.py now uses hashlib.md5().hexdigest() directly - Update test_hash.py to test key_hash instead Co-Authored-By: Claude Opus 4.5 <[email protected]>

dimitri-yatsenko merged commit 471b8a9 into pre/v2.0 Jan 13, 2026
7 of 8 checks passed

dimitri-yatsenko deleted the feature/npy-codec branch January 13, 2026 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add NpyCodec for lazy-loading numpy arrays #1331

feat: Add NpyCodec for lazy-loading numpy arrays #1331

Uh oh!

dimitri-yatsenko commented Jan 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add NpyCodec for lazy-loading numpy arrays #1331

feat: Add NpyCodec for lazy-loading numpy arrays #1331

Uh oh!

Conversation

dimitri-yatsenko commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Codec Types

Addressing Schemes

Usage

Changes

Test Plan

Documentation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dimitri-yatsenko commented Jan 12, 2026 •

edited

Loading